Blog Post 2

Post2

ManiShankerKamarapu

Amazon Review analysis

Author

Mani Shanker Kamarapu

Published

October 2, 2022

Introduction

In this blog I plan to scrape reviews on different products in Amazon and do pre-processing of data.

Loading the libraries

library(polite)
library(rvest)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()

library(stringr)
library(quanteda)

Package version: 3.2.3
Unicode version: 14.0
ICU version: 70.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.

library(tidyr)
library(RColorBrewer)
library(quanteda.textplots)
library(wordcloud)
library(wordcloud2)
library(devtools)

Loading required package: usethis

library(quanteda.dictionaries)
library(quanteda.sentiment)


Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

    data_dictionary_LSD2015

knitr::opts_chunk$set(echo = TRUE)

Using the below code for checking if the website is scrapable or not

bow("https://www.amazon.com/")

<polite session> https://www.amazon.com/
    User-agent: polite R package
    robots.txt: 149 rules are defined for 2 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

Acquiring the data

The below function is used for web-scrapping reviews from Amazon. I am acquiring review title, review text and number of stars for the review. I have collected the reviews for the following books:

A Game of Thrones: A Song of Ice and Fire, Book 1
A Clash of Kings: A Song of Ice and Fire, Book 2
A Storm of Swords: A Song of Ice and Fire, Book 3
A Feast for Crows: A Song of Ice and Fire, Book 4
A Dance with Dragons: A Song of Ice and Fire, Book 5
Twilight: The Twilight Saga, Book 1
New Moon: The Twilight Saga, Book 2
Eclipse: The Twilight Saga, Book 3
Breaking Dawn: The Twilight Saga, Book 4
The Hunger Games
Catching Fire: The Hunger Games
Mockingjay: The Hunger Games, Book 3

scrape_amazon <- function(ASIN, page_num){
  
  url_reviews <- paste0("https://www.amazon.com/product-reviews/",ASIN,"/?pageNumber=",page_num)
  
  doc <- read_html(url_reviews) # Assign results to `doc`
  
  # Review Title
  doc %>% 
    html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
    html_text() -> review_title
  
  # Review Text
  doc %>% 
    html_nodes("[class='a-size-base review-text review-text-content']") %>%
    html_text() -> review_text
  
  # Number of stars in review
  doc %>%
    html_nodes("[data-hook='review-star-rating']") %>%
    html_text() -> review_star
  
  # Return a tibble
  tibble(review_title,
         review_text,
         review_star,
         page = page_num,
         ASIN) %>% return()
}

Using the above function I have scraped equal number of reviews for each series to compare them. I have used a for loop and sleep time of 2 seconds to avoid bot detection. Then converted the whole data into csv format.

Reading the data

reviews <- read_csv("amazonreview.csv")

New names:
Rows: 46450 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): review_title, review_text, review_star, ASIN dbl (2): ...1, page
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

reviews

Data Preprocessing

I have cleaned the text of the reviews by removing punctuations, numbers, UTF symbols, &amp, digits, new line characters, single length words, Pascal case words were removed from the tweets text using the stringr library departments communicate information to alleviate specific public functions. Stopwords are removed. The stopwords collection is taken from stopwords-iso and SMART. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. The analysis done in the project is based on the categorized series titles to compare sentiments and topics on basis of series.

clean_text <- function (text) {
  # Remove url
  str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>% 
    # Remove mentions
    str_remove_all("@[[:alnum:]_]*") %>% 
    # Replace "&" character reference with "and"
    str_replace_all("&amp;", "and") %>%
    # Remove punctuation
    str_remove_all("[[:punct:]]") %>%
    # remove digits
    str_remove_all("[[:digit:]]") %>%
    # Replace any newline characters with a space
    str_replace_all("\\\n|\\\r", " ") %>%
    # remove strings like "<U+0001F9F5>"
    str_remove_all("<.*?>") %>% 
    # Make everything lowercase
    str_to_lower() %>%
    # Remove any trailing white space around the text and inside a string
    str_squish()
}

Tidying the cleaned data

Dropping the NA in the cleaned text.

reviews$clean_text <- clean_text(reviews$review_text) 
reviews <- reviews %>%
  drop_na(clean_text)
reviews

Corpus of the data

text <- corpus(c(reviews$clean_text))
text <- tokens(text, remove_punct=TRUE, remove_numbers = TRUE, remove_separators = TRUE, remove_symbols = TRUE) %>% 
              tokens_select(pattern=c(stopwords("en"), "im", "didnt", "couldnt","wasnt", "id", "ive", "isnt", "dont", "wont", "shes", "doesnt"), selection="remove") %>% 
              tokens_select(pattern=stopwords("SMART"), 
                selection="remove")

Warning: 'stopwords(language = "SMART")' is deprecated.
Use 'stopwords(source = "smart")' instead.
See help("Deprecated")

text <- dfm(text)
text

Document-feature matrix of: 46,447 documents, 74,093 features (99.96% sparse) and 0 docvars.
       features
docs    love fantasy kid stories set creative worlds featuring varied groups
  text1    1      12   1       3   1        1      2         1      1      1
  text2    2       8   0       1   1        0      0         0      1      0
  text3    1       1   0       0   0        0      0         0      0      0
  text4    0       7   0       1   0        0      0         0      0      0
  text5    0      12   0       0   0        0      0         0      0      0
  text6    0       0   0       1   0        0      0         0      0      0
[ reached max_ndoc ... 46,441 more documents, reached max_nfeat ... 74,083 more features ]

sum(ntoken(text))

[1] 1682431

The total number of tokens in the text are 1682431.

summary(corpus(c(reviews$clean_text)))

Finding the frequency and rank of each word in the data

word_counts <- as.data.frame(sort(colSums(text),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$word <- row.names(word_counts)
word_counts$Rank <- c(1:ncol(text))
word_counts

text_dfm <- dfm_trim(text, min_termfreq = 50, docfreq_type = "prop")
# create fcm from dfm
text_fcm <- fcm(text_dfm)
text_fcm

Feature co-occurrence matrix of: 3,454 by 3,454 features.
            features
features      love fantasy kid stories  set creative worlds groups characters
  love       13025    2234 337    1701 1155      224    161     72      11380
  fantasy        0    1779  45     554  456       76     99     20       6374
  kid            0       0  48      23   27        6      5      5        223
  stories        0       0   0     313  190       28     33     33       2790
  set            0       0   0       0  237       26     29     11       1992
  creative       0       0   0       0    0       22      3      2        309
  worlds         0       0   0       0    0        0      8      4        261
  groups         0       0   0       0    0        0      0      2        114
  characters     0       0   0       0    0        0      0      0      15686
  fighting       0       0   0       0    0        0      0      0          0
            features
features     fighting
  love            579
  fantasy         149
  kid               6
  stories          51
  set              78
  creative         13
  worlds            5
  groups            4
  characters      723
  fighting         54
[ reached max_feat ... 3,444 more features, reached max_nfeat ... 3,444 more features ]

Finding the top features from the above feature co-occurrence matrix and ploting the network plot.

# pull the top features
top_features <- names(topfeatures(text_fcm, 50))
# retain only those top features as part of our matrix
even_text_fcm <- fcm_select(text_fcm, pattern = top_features, selection = "keep")
# check dimensions
dim(even_text_fcm)

[1] 50 50

# compute size weight for vertices in network
size <- log(colSums(even_text_fcm))
# create plot
textplot_network(even_text_fcm, vertex_size = size / max(size) * 2)

Wordcloud of the data

textplot_wordcloud(text, min_size = 1.5, max_size = 4, random_order = TRUE, max_words = 150, min_count = 50, color = brewer.pal(8, "Dark2") )

I can observe that book has the highest count and then there are interesting words like great, enjooyed, amazing, good, love which express the feelings of the people which can help in our sentimental analysis.

Further study

I will be transforming and categorizing data and also plot some analysis plots and if possible also do some sentiment analysis.

--- title: "Blog Post 2" author: "Mani Shanker Kamarapu" desription: "Acquiring and Preprocessing data" date: "10/2/2022" format: html: df-print: paged css: styles.css toc: true code-fold: false code-copy: true code-tools: true categories: - Post2 - ManiShankerKamarapu - Amazon Review analysis --- ## Introduction In this blog I plan to scrape reviews on different products in Amazon and do pre-processing of data. ## Loading the libraries ```{r} library(polite) library(rvest) library(tidyverse) library(stringr) library(quanteda) library(tidyr) library(RColorBrewer) library(quanteda.textplots) library(wordcloud) library(wordcloud2) library(devtools) library(quanteda.dictionaries) library(quanteda.sentiment) knitr::opts_chunk$set(echo = TRUE) ``` Using the below code for checking if the website is scrapable or not ```{r} bow("https://www.amazon.com/") ``` ## Acquiring the data The below function is used for web-scrapping reviews from Amazon. I am acquiring review title, review text and number of stars for the review. I have collected the reviews for the following books: - A Game of Thrones: A Song of Ice and Fire, Book 1 - A Clash of Kings: A Song of Ice and Fire, Book 2 - A Storm of Swords: A Song of Ice and Fire, Book 3 - A Feast for Crows: A Song of Ice and Fire, Book 4 - A Dance with Dragons: A Song of Ice and Fire, Book 5 - Twilight: The Twilight Saga, Book 1 - New Moon: The Twilight Saga, Book 2 - Eclipse: The Twilight Saga, Book 3 - Breaking Dawn: The Twilight Saga, Book 4 - The Hunger Games - Catching Fire: The Hunger Games - Mockingjay: The Hunger Games, Book 3 ```{r} scrape_amazon <- function(ASIN, page_num){ url_reviews <- paste0("https://www.amazon.com/product-reviews/",ASIN,"/?pageNumber=",page_num) doc <- read_html(url_reviews) # Assign results to `doc` # Review Title doc %>% html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>% html_text() -> review_title # Review Text doc %>% html_nodes("[class='a-size-base review-text review-text-content']") %>% html_text() -> review_text # Number of stars in review doc %>% html_nodes("[data-hook='review-star-rating']") %>% html_text() -> review_star # Return a tibble tibble(review_title, review_text, review_star, page = page_num, ASIN) %>% return() } ``` Using the above function I have scraped equal number of reviews for each series to compare them. I have used a for loop and sleep time of 2 seconds to avoid bot detection. Then converted the whole data into csv format.                                                                                                     ## Reading the data ```{r} reviews <- read_csv("amazonreview.csv") reviews ``` ## Data Preprocessing I have cleaned the text of the reviews by removing punctuations, numbers, UTF symbols, &amp, digits, new line characters, single length words, Pascal case words were removed from the tweets text using the `stringr` library departments communicate information to alleviate specific public functions. Stopwords are removed. The stopwords collection is taken from `stopwords-iso` and `SMART`. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. The analysis done in the project is based on the categorized series titles to compare sentiments and topics on basis of series. ```{r} clean_text <- function (text) { # Remove url str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>% # Remove mentions str_remove_all("@[[:alnum:]_]*") %>% # Replace "&" character reference with "and" str_replace_all("&", "and") %>% # Remove punctuation str_remove_all("[[:punct:]]") %>% # remove digits str_remove_all("[[:digit:]]") %>% # Replace any newline characters with a space str_replace_all("\\\n|\\\r", " ") %>% # remove strings like "<U+0001F9F5>" str_remove_all("<.*?>") %>% # Make everything lowercase str_to_lower() %>% # Remove any trailing white space around the text and inside a string str_squish() } ``` ## Tidying the cleaned data Dropping the NA in the cleaned text. ```{r} reviews$clean_text <- clean_text(reviews$review_text) reviews <- reviews %>% drop_na(clean_text) reviews ``` ### Corpus of the data ```{r} text <- corpus(c(reviews$clean_text)) text <- tokens(text, remove_punct=TRUE, remove_numbers = TRUE, remove_separators = TRUE, remove_symbols = TRUE) %>% tokens_select(pattern=c(stopwords("en"), "im", "didnt", "couldnt","wasnt", "id", "ive", "isnt", "dont", "wont", "shes", "doesnt"), selection="remove") %>% tokens_select(pattern=stopwords("SMART"), selection="remove") text <- dfm(text) text ``` ```{r} sum(ntoken(text)) ``` The total number of tokens in the text are `1682431`. ```{r} summary(corpus(c(reviews$clean_text))) ``` Finding the frequency and rank of each word in the data ```{r} word_counts <- as.data.frame(sort(colSums(text),dec=T)) colnames(word_counts) <- c("Frequency") word_counts$word <- row.names(word_counts) word_counts$Rank <- c(1:ncol(text)) word_counts ``` ```{r} text_dfm <- dfm_trim(text, min_termfreq = 50, docfreq_type = "prop") # create fcm from dfm text_fcm <- fcm(text_dfm) text_fcm ``` Finding the top features from the above feature co-occurrence matrix and ploting the network plot. ```{r} # pull the top features top_features <- names(topfeatures(text_fcm, 50)) # retain only those top features as part of our matrix even_text_fcm <- fcm_select(text_fcm, pattern = top_features, selection = "keep") # check dimensions dim(even_text_fcm) # compute size weight for vertices in network size <- log(colSums(even_text_fcm)) # create plot textplot_network(even_text_fcm, vertex_size = size / max(size) * 2) ``` ## Wordcloud of the data ```{r} textplot_wordcloud(text, min_size = 1.5, max_size = 4, random_order = TRUE, max_words = 150, min_count = 50, color = brewer.pal(8, "Dark2") ) ``` I can observe that book has the highest count and then there are interesting words like great, enjooyed, amazing, good, love which express the feelings of the people which can help in our sentimental analysis. ## Further study I will be transforming and categorizing data and also plot some analysis plots and if possible also do some sentiment analysis.